35 research outputs found
Nonparametric Linear Feature Learning in Regression Through Regularisation
Representation learning plays a crucial role in automated feature selection,
particularly in the context of high-dimensional data, where non-parametric
methods often struggle. In this study, we focus on supervised learning
scenarios where the pertinent information resides within a lower-dimensional
linear subspace of the data, namely the multi-index model. If this subspace
were known, it would greatly enhance prediction, computation, and
interpretation. To address this challenge, we propose a novel method for linear
feature learning with non-parametric prediction, which simultaneously estimates
the prediction function and the linear subspace. Our approach employs empirical
risk minimisation, augmented with a penalty on function derivatives, ensuring
versatility. Leveraging the orthogonality and rotation invariance properties of
Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising
alternative minimisation, we iteratively rotate the data to improve alignment
with leading directions and accurately estimate the relevant dimension in
practical settings. We establish that our method yields a consistent estimator
of the prediction function with explicit rates. Additionally, we provide
empirical results demonstrating the performance of RegFeaL in various
experiments.Comment: 42 pages, 5 figure
Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
A recent line of empirical studies has demonstrated that SGD might exhibit a
heavy-tailed behavior in practical settings, and the heaviness of the tails
might correlate with the overall performance. In this paper, we investigate the
emergence of such heavy tails. Previous works on this problem only considered,
up to our knowledge, online (also called single-pass) SGD, in which the
emergence of heavy tails in theoretical findings is contingent upon access to
an infinite amount of data. Hence, the underlying mechanism generating the
reported heavy-tailed behavior in practical settings, where the amount of
training data is finite, is still not well-understood. Our contribution aims to
fill this gap. In particular, we show that the stationary distribution of
offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and
the approximation error is controlled by how fast the empirical distribution of
the training data converges to the true underlying data distribution in the
Wasserstein metric. Our main takeaway is that, as the number of data points
increases, offline SGD will behave increasingly 'power-law-like'. To achieve
this result, we first prove nonasymptotic Wasserstein convergence bounds for
offline SGD to online SGD as the number of data points increases, which can be
interesting on their own. Finally, we illustrate our theory on various
experiments conducted on synthetic data and neural networks.Comment: In Neural Information Processing Systems (NeurIPS), Spotlight
Presentation, 202
Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
Algorithmic stability is an important notion that has proven powerful for
deriving generalization bounds for practical algorithms. The last decade has
witnessed an increasing number of stability bounds for different algorithms
applied on different classes of loss functions. While these bounds have
illuminated various properties of optimization algorithms, the analysis of each
case typically required a different proof technique with significantly
different mathematical tools. In this study, we make a novel connection between
learning theory and applied probability and introduce a unified guideline for
proving Wasserstein stability bounds for stochastic optimization algorithms. We
illustrate our approach on stochastic gradient descent (SGD) and we obtain
time-uniform stability bounds (i.e., the bound does not increase with the
number of iterations) for strongly convex losses and non-convex losses with
additive noise, where we recover similar results to the prior art or extend
them to more general cases by using a single proof technique. Our approach is
flexible and can be generalizable to other popular optimizers, as it mainly
requires developing Lyapunov functions, which are often readily available in
the literature. It also illustrates that ergodicity is an important component
for obtaining time-uniform bounds -- which might not be achieved for convex or
non-convex losses unless additional noise is injected to the iterates. Finally,
we slightly stretch our analysis technique and prove time-uniform bounds for
SGD under convex and non-convex losses (without additional additive noise),
which, to our knowledge, is novel.Comment: 49 pages, NeurIPS 202
Efficient Bayesian Model Selection in PARAFAC via Stochastic Thermodynamic Integration
International audienceParallel factor analysis (PARAFAC) is one of the most popular tensor factorization models. Even though it has proven successful in diverse application fields, the performance of PARAFAC usually hinges up on the rank of the factorization, which is typically specified manually by the practitioner. In this study, we develop a novel parallel and distributed Bayesian model selection technique for rank estimation in large-scale PARAFAC models. The proposed approach integrates ideas from the emerging field of stochastic gradient Markov Chain Monte Carlo, statistical physics, and distributed stochastic optimization. As opposed to the existing methods, which are based on some heuristics, our method has a clear mathematical interpretation, and has significantly lower computational requirements, thanks to data subsampling and parallelization. We provide formal theoretical analysis on the bias induced by the proposed approach. Our experiments on synthetic and large-scale real datasets show that our method is able to find the optimal model order while being significantly faster than the state-of-the-art
Generalized Sliced Wasserstein Distances
The Wasserstein distance and its variations, e.g., the sliced-Wasserstein
(SW) distance, have recently drawn attention from the machine learning
community. The SW distance, specifically, was shown to have similar properties
to the Wasserstein distance, while being much simpler to compute, and is
therefore used in various applications including generative modeling and
general supervised/unsupervised learning. In this paper, we first clarify the
mathematical connection between the SW distance and the Radon transform. We
then utilize the generalized Radon transform to define a new family of
distances for probability measures, which we call generalized
sliced-Wasserstein (GSW) distances. We also show that, similar to the SW
distance, the GSW distance can be extended to a maximum GSW (max-GSW) distance.
We then provide the conditions under which GSW and max-GSW distances are indeed
distances. Finally, we compare the numerical performance of the proposed
distances on several generative modeling tasks, including SW flows and SW
auto-encoders